Scenario 1: sampling minibatches from fully observed datasets

To perform online iNMF, we need to install the online branch. Please see the instruction below.

library(devtools)
install_github("MacoskoLab/liger", ref = "online")

We first create a liger object by passing the filenames of HDF5 files containing the raw count data. The data can be downloaded here.

library(liger)
pbmcs = createLiger(list(stim = "stim_PBMCs.h5", ctrl = "ctrl_PBMCs.h5"))

We then perform the normalization, gene selection, and gene scaling in an online fashion, reading the data from disk in small batches.

pbmcs = normalize(pbmcs)
pbmcs = selectGenes(pbmcs, var.thresh = 0.2, do.plot = F)
pbmcs = scaleNotCenter(pbmcs)

Online Integrative Nonnegative Matrix Factorization

Now we can use online iNMF to factorize the data, again using only minibatches that we read from the HDF5 files on demand (default mini-batch size = 5000).

pbmcs = online_iNMF(pbmcs, k = 20, max.epochs = 5)

Quantile Normalization and Downstream Analysis

After performing the factorization, we can perform quantile normalization to align the datasets.

pbmcs = quantile_norm(pbmcs)

We can also visualize the cell factor loadings in two dimensions using t-SNE or UMAP.

pbmcs = runUMAP(pbmcs)
plotByDatasetAndCluster(pbmcs, axis.labels = c("UMAP1","UMAP2"))

We can also compare clusters or datasets (within each cluster) to identify differentially expressed genes. The runWilcoxon function performs differential expression analysis by sampling a specified number of cells from the dataset on disk, then performing an in-memory Wilcoxon rank-sum test on this subset. Thus, users can still analyze large datasets with a fixed amount of memory.

de_genes = runWilcoxon(pbmcs, compare.method = "clusters", max.sample = 5000)
head(de_genes)
##         feature group   avgExpr        logFC statistic       auc      pval
## 1    AL627309.1     1 -23.02585 -0.006801316  746300.5 0.4997864 0.7123956
## 2 RP11-206L10.2     1 -23.02585 -0.003373260  746460.0 0.4998932 0.7946840
## 3     LINC00115     1 -23.02585 -0.089421497  742313.0 0.4971160 0.1738807
## 4         NOC2L     1 -21.44847  0.502855594  770801.5 0.5161943 0.0294685
## 5        KLHL17     1 -22.97600  0.033192611  748163.0 0.5010337 0.3023024
## 6       PLEKHN1     1 -23.02585 -0.071678476  743110.5 0.4976501 0.2198825
##        padj pct_in pct_out
## 1 0.8712779    100     100
## 2 0.8985167    100     100
## 3 0.4964537    100     100
## 4 0.1521384    100     100
## 5 0.6598760    100     100
## 6 0.5574626    100     100

Scenario 2: iterative refinement by incorporating new datasets

We can also perform online iNMF with continually arriving datasets.

MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp = normalize(MOp)
MOp = selectGenes(MOp, var.thresh = 2)
MOp.vargenes = MOp@var.genes
MOp = scaleNotCenter(MOp)
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))

MOp2 = createLiger(list(nuclei = "allen_smarter_nuclei.h5"))
MOp2 = normalize(MOp2)
MOp2@var.genes = MOp@var.genes
MOp2 = scaleNotCenter(MOp2)
MOp = online_iNMF(MOp, X_new = list(nuclei = "allen_smarter_nuclei.h5"), k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))

Scenario 3: projecting new datasets

MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp@var.genes = MOp.vargenes
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))

MOp = online_iNMF(MOp, X_new = list(nuclei = "allen_smarter_nuclei.h5"), k = 40, project = TRUE)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))